fix(BA-2404): Generalize gpu_allocated field with improved SlotName and ResourceSlot typing #6087

achimnol · 2025-09-27T14:46:02Z

Core changes:

Rewrite SlotName to be a UserString subclass which lazily parses the slot name format.
- {device_name}.{major_type}[:{minor_type}]
Update gpu_allocated calculation to use SlotName.is_accelerator() method to sum all allocated accelerators
- Centralize the logic to guess if the slot name is an accelerator or not, for the future.
Update ResourceSlot to use UserDict with proper generic type arguments
- Since typeshed and the stdlib documentation indicates initialization should accept a single dict instance rather than kwargs, update all of such usage.
- For compatibility with existing codebase, this PR uses str as the key type, but this should be updated to SlotName in future work (BA-2628; Migrate Manager API to Pydantic-based Request/Response Models #6173)

Warning

When there are multiple different accelerators allocated in a single container or installed in a single agent node, this simple "sum" for gpu_allocated may have a non-sense value when mixing different units and fraction scales. We need some design discussion here.
This is kept as a known issue because the previous implementation was same in this sense.

Checklist: (if applicable)

Mention to the original issue
Test cases for:
- New SlotName parsing implementation

📚 Documentation preview 📚: https://sorna--6087.org.readthedocs.build/en/6087/

📚 Documentation preview 📚: https://sorna-ko--6087.org.readthedocs.build/ko/6087/

src/ai/backend/common/types.py

TypeError, NameError, and AttributeError indicate logical bugs rather than transient failures. Note: SQLAlchemy's StatementError is also a subclass of TypeError.

src/ai/backend/common/types.py

Copilot

Pull Request Overview

This PR refactors the SlotName type from a simple NewType to a UserString class with parsing capabilities, and updates ResourceSlot to properly handle string keys and normalize values. The changes improve type safety and consistency in GPU resource allocation tracking across different accelerator types (CUDA devices, shares, MIG variants, and NPUs).

Key Changes

Converted SlotName from NewType to a UserString class with lazy parsing and accelerator detection capabilities
Enhanced ResourceSlot to properly normalize keys to strings and process raw values (int, float, str, Decimal, BinarySize) into Decimal
Updated GPU allocation tracking to use generic accelerator detection instead of hardcoded CUDA-specific slot names
Added serialization support for SlotName and improved JSON encoding for test fixtures

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/ai/backend/common/types.py	Core refactoring: SlotName class implementation and ResourceSlot normalization improvements
src/ai/backend/common/msgpack.py	Added msgpack serialization support for SlotName
src/ai/backend/common/utils.py	Added debugging utility `pprint_with_type` for type inspection
src/ai/backend/common/resilience/policies/retry.py	Added default non-retryable exceptions to retry policy
src/ai/backend/manager/models/resource_usage.py	Updated GPU allocation to use SlotName.is_accelerator()
src/ai/backend/manager/repositories/user/repository.py	Updated GPU stats aggregation to sum all accelerator slots
src/ai/backend/manager/repositories/group/repository.py	Updated container stats to use generic accelerator detection
src/ai/backend/manager/repositories/resource_preset/repository.py	Added str() conversion for dict keys
src/ai/backend/manager/models/image.py	Added str() conversion for resource limit keys
src/ai/backend/manager/api/etcd.py	Added str() conversion for API response keys
src/ai/backend/manager/idle.py	Improved variable naming from `val` to `slot_val`/`util_val`
src/ai/backend/manager/sokovan/scheduler/selectors/concentrated.py	Added type hint for sort_key list
src/ai/backend/agent/resources.py	Updated to use ResourceSlot type and SlotName class
src/ai/backend/agent/alloc_map.py	Added str() conversions for fnmatch operations
src/ai/backend/agent/docker/agent.py	Changed from dict unpacking to slots.copy()
src/ai/backend/agent/dummy/agent.py	Changed from dict unpacking to slots.copy()
src/ai/backend/agent/kubernetes/agent.py	Changed from dict unpacking to slots.copy()
src/ai/backend/agent/stage/kernel_lifecycle/docker/resource.py	Changed from dict unpacking to slots.copy()
src/ai/backend/accelerator/cuda_open/plugin.py	Added str() conversion for slot_name in metadata
tests/manager/conftest.py	Enhanced fixture_json_encoder to handle ResourceSlot and Decimal
tests/manager/repositories/keypair_resource_policy/test_keypair_resource_policy.py	Fixed test data to use string values for resource slots
tests/common/test_types.py	Added comprehensive tests for SlotName parsing and BinarySize/Decimal conversion
changes/2404.fix.md	Added changelog entry

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/ai/backend/common/types.py

src/ai/backend/common/utils.py

src/ai/backend/common/types.py

src/ai/backend/common/msgpack.py

…pilot

HyeockJinKim · 2025-11-05T06:08:59Z

src/ai/backend/common/resilience/policies/retry.py

+DEFAULT_NON_RETRYABLE_ERRORS: Final = (
+    TypeError,
+    NameError,
+    AttributeError,
+)


Does this need to be provided publicly?

github-actions bot assigned achimnol Sep 27, 2025

github-actions bot added size:XL 500~ LoC comp:manager Related to Manager component comp:agent Related to Agent component comp:common Related to Common component size:L 100~500 LoC and removed size:XL 500~ LoC labels Sep 27, 2025

github-advanced-security bot found potential problems Sep 29, 2025

View reviewed changes

src/ai/backend/common/types.py Fixed Show fixed Hide fixed

github-actions bot added the area:docs Documentations label Sep 29, 2025

achimnol added 6 commits October 29, 2025 12:57

fix: Let gpu_allocated metric embrace all types of accelerators

4e0d231

feat: Support SlotName serialization in msgpack

4728605

fix: Add some non-retryable exceptions to the resilience util

68ad11e

TypeError, NameError, and AttributeError indicate logical bugs rather than transient failures. Note: SQLAlchemy's StatementError is also a subclass of TypeError.

feat: Add common.utils.pprint_with_type()

9ff31ed

fix: Minimize the scope of modifications for typed ResourceSlot

f801499

fix: minor typing

4aab698

achimnol force-pushed the fix/generalize-gpu-allocated-stat-field branch from dbb23e9 to 4aab698 Compare October 29, 2025 06:07

github-advanced-security bot found potential problems Oct 29, 2025

View reviewed changes

src/ai/backend/common/types.py Dismissed Show dismissed Hide dismissed

achimnol added 3 commits October 29, 2025 15:12

refactor: Resimplify

629bd85

docs: Add news fragment

773fea5

fix: test failure due to missing type conversion in fixture

0665b91

achimnol marked this pull request as ready for review October 29, 2025 08:38

achimnol requested review from HyeockJinKim and Copilot October 29, 2025 08:38

Copilot AI reviewed Oct 29, 2025

View reviewed changes

achimnol added 4 commits October 29, 2025 17:44

fix: Update legacy test code

ab51571

refactor: Simplify as suggested by Copilot

6b6cee6

fix: typo

efdc3a5

fix: Ensure init of slotted attributes in SlotName as suggested by Co…

e7bd3fb

…pilot

HyeockJinKim reviewed Nov 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(BA-2404): Generalize gpu_allocated field with improved SlotName and ResourceSlot typing #6087

fix(BA-2404): Generalize gpu_allocated field with improved SlotName and ResourceSlot typing #6087

achimnol commented Sep 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyeockJinKim Nov 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(BA-2404): Generalize gpu_allocated field with improved SlotName and ResourceSlot typing #6087

Are you sure you want to change the base?

fix(BA-2404): Generalize gpu_allocated field with improved SlotName and ResourceSlot typing #6087

Conversation

achimnol commented Sep 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyeockJinKim Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

achimnol commented Sep 27, 2025 •

edited by github-actions bot

Loading